author: Daniel Chen, Sara Altman id: analyze_data_with_python_using_posit_workbench_and_snowflake summary: Analyze Data with R using Posit Workbench and Snowflake categories: Getting-Started environments: web status: Published feedback link: https://github.com/Snowflake-Labs/sfguides/issues tags: Getting Started, Data Science, R, Posit Workbench, Native Applications
Duration: 1
In this guide, we’ll use Python to analyze data in Snowflake using the Posit Workbench Native App. You’ll learn how to launch the Posit Workbench Native App and use the available VS Code IDE. You’ll also learn how to use the Ibis library to translate Python code into SQL, allowing you to run data operations directly in Snowflake’s high-performance computing environment.
We’ll focus on a healthcare example by analyzing heart failure data. We’ll guide you through accessing the data and performing data cleaning, transformation, and visualization. Finally, you’ll see how to generate an HTML report, build an interactive Shiny app, and write data back to Snowflake—-completing an end-to-end analysis in Python entirely within Snowflake.
accountadmin role.Along the way, you will use Python to analyze which variables are associated with survival among patients with heart failure. You can follow along with this quickstart guide, or look at the materials provided in the accompanying repository: https://github.com/posit-dev/snowflake-posit-quickstart-python.
Duration: 15
Before we begin there are a few components we need to prepare. We need to:
For this analysis, we’ll use the Heart Failure Clinical Records dataset. The data is available for download as a CSV from the UCI Machine Learning Repository.
We’ll walk through how to download the data from UCI and then upload it to Snowflake from a CSV.
aside positive
If you have the necessary permissions in Snowflake, you can also import the data from this S3 bucket: s3://heart-failure-records/heart_failure.csv.
Download the data from UCI here, and then unzip the downloaded file.
Log into Snowsight,
then click Create > Add Data. You can find
the Create button in the upper-left corner.
Choose the Load Data into a Table option, then select
your downloaded heart failure CSV. Specify an existing database or
create a new one for the heart failure data (we called ours
HEART_FAILURE). Then, select
+ Create a new table and name it
HEART_FAILURE.
You should now be able to see the heart failure data in Snowsight.
Navigate to Data > Databases, then select
the database to which you added the data (e.g.,
HEART_FAILURE). Expand the database, schema, and tables
until you see the HEART_FAILURE table.
We can now start exploring the data using Posit Workbench. You can find Posit Workbench as a Snowflake Native Application and use it to connect to your database.
Once Posit Workbench is installed, click on the app under
Installed Apps to launch the app. If you do not see the
Posit Workbench app listed, ask your Snowflake account administrator for
access to the app.
After clicking on the app, you will see a page with configuration
instructions and a blue Launch app button.
Click on Launch app. This should take you to the webpage
generated for the Workbench application. You may be prompted to first
login to Snowflake using your regular credentials or authentication
method.
Posit Workbench provides several IDEs, including VS Code, RStudio Pro, and JupyterLab. For this analysis we will use VS Code.
Within Posit Workbench, click New Session to launch a
new session.
When prompted, select VS Code.
Next, connect to your Snowflake account from within Posit Workbench.
Under Session Credentials, click the button with the
Snowflake icon to sign in to Snowflake. Follow the sign in prompts.
When you’re successfully signed into Snowflake, the Snowflake button will turn blue and there will be a checkmark in the upper-left corner.
Click Start Session to launch VS Code.
Once everything is ready, you will be able to work with your Snowflake data in VS Code. Since the IDE is provided by the Posit Workbench Native App, your entire analysis will occur securely within Snowflake.
The Quarto and Shiny VS Code Extensions support the development of Quarto documents and Shiny apps in VS Code. The Jupyter extension provides support for running Python code in notebook cells
Install these extensions:
Install.You can learn more about these extensions here: Shiny extension, Quarto extension.
This Quickstart will step you through the analysis contained in https://github.com/posit-dev/snowflake-posit-quickstart-python/blob/main/quarto.qmd. To follow along, clone the GitHub repo:
Clone Git Repository.Paste the URL (https://github.com/posit-dev/snowflake-posit-quickstart-python/) of the GitHub repo into the Command Palette.
Click Clone from URL.
Follow the prompts to authenticate to GitHub.
When prompted, click Open to open the cloned
repository.
quarto.qmd.Create a virtual environment and install dependencies from the
provided requirements.txt file:
Ctrl/Cmd+Shift+P), then
search for “Python: Create Environment”.Venv to create a .venv virtual
environment.requirements.txt. If you don’t see this prompt, open a
terminal and run pip install -r requirements.txt after the
environment is created.OK. VS Code will install the required packages
and create a virtual environment in your current working directory
(which should be the folder created from the GitHub repo).See the Python Environments in VS Code section of the Posit Workbench User Guide to learn more about Python environments in Posit Workbench.
Duration: 5
We’ll run our code in our Python environment, but the code will use data stored in our database on Snowflake.
To access this data, we’ll use the Ibis library to connect to the database and querty the data from Python, without having to write raw SQL. Let’s take a look at how this works.
Ibis is an open source dataframe library that works with a wide variety of backends, including Snowflake.
First, we import ibis, then use
ibis.snowflake.connect to connect to the Snowflake
database. We need to provide a warehouse for compute and a
database to connect to. We can also provide a
schema here to make connecting to specific tables
easier.
import ibis
con = ibis.snowflake.connect(
warehouse="DEFAULT_WH",
database="HEART_FAILURE",
schema="PUBLIC",
connection_name="workbench"
)
aside positive
Your
warehouse,database,catalogvalues may differ, depending on your available warehouses and how you named your database.
The variable con now stores our connection.
Once we build a connection, we can use table() to create
an Ibis table
expression that represents the database table.
heart_failure = con.table("HEART_FAILURE")
We can now use Ibis to interact with heart_failure. For
example, we can filter rows and select columns from our data.
heart_failure_filtered = (
heart_failure.filter(heart_failure.AGE < 50)
.rename(
{
"age": "AGE",
"diabetes": "DIABETES",
"serum_sodium": "SERUM_SODIUM",
"serum_creatinine": "SERUM_CREATININE",
"sex": "SEX",
"death_event": "DEATH_EVENT",
}
)
.select(
["age", "diabetes", "serum_sodium", "serum_creatinine", "sex", "death_event"]
)
)
Right now, heart_failure_filtered is still a table
expression. Ibis lazily evaluates commands, which means that the
full query is never run on the database unless explicitly requested.
You can force Ibis to compile the table expression into SQL and run
that SQL on Snowflake with .execute() or
.to_pandas().
heart_failure_filtered.execute()
If we want to see the SQL code that Ibis generates, we can run
ibis.to_sql().
ibis.to_sql(heart_failure_filtered)
SELECT
"t0"."AGE" AS "age",
"t0"."DIABETES" AS "diabetes",
"t0"."SERUM_SODIUM" AS "serum_sodium",
"t0"."SERUM_CREATININE" AS "serum_creatinine",
"t0"."SEX" AS "sex",
"t0"."DEATH_EVENT" AS "death_event"
FROM "HEART_FAILURE" AS "t0"
WHERE
"t0"."AGE" < 50
This system:
We don’t need to manage the process, it happens automatically behind the scenes.
You can learn more about Ibis here. Take a look at the Snowflake backend documentation to learn more about using Ibis to interact with Snowflake specifically.
Duration: 1
You can also use Ibis to create a new table in a database or append to an existing table.
To add a new table, use create_table().
con.create_table("HEART_FAILURE_FILTERED", heart_failure_filtered)
To insert data into an existing table, use insert().
Now that we understand how to interact with our database, we can use Python to perform our analysis.
Duration: 5
We want to understand which variables in HEART_FAILURE
are associated with survival of patients with heart failure.
First we convert the column names to lowercase, so we won’t need to worry about capitalization.
heart_failure = heart_failure.rename(
{
"age": "AGE",
"diabetes": "DIABETES",
"serum_sodium": "SERUM_SODIUM",
"serum_creatinine": "SERUM_CREATININE",
"sex": "SEX",
"death_event": "DEATH_EVENT",
}
)
For now, we’ll focus on patients younger than 50. We also reduce the data to just the columns we’re interested in.
heart_failure_filtered = (
heart_failure
.filter(heart_failure.age < 50) # Filter to age < 50
.select(["age", "diabetes", "serum_sodium", "serum_creatinine", "sex", "death_event"])
)
aside positive
By default, Ibis is in deferred mode and lazily evaluates table expressions. To more easily interact with your tables and see the results of your code, you can turn on interactive code. In interactive mode, expressions are executed when printed to the console. Turn on interactive mode with
ibis.options.interactive = True.
plotnineDuration: 5
The heart failure data provides important insights that can help us:
Visualizing clinical variables across different patient groups can help identify patterns.
We can use plotnine to visually
compare sodium levels across different patient groups. In this plot, we
see the distribution of serum sodium based on whether the patients have
diabetes and whether they survived (0) or died
(1) during the follow-up period.
heart_failure_df = heart_failure_filtered.execute()
# Convert columns to strings for plot
heart_failure_df['death_event'] = heart_failure_df['death_event'].astype(str)
heart_failure_df['diabetes'] = heart_failure_df['diabetes'].astype(str)
(
ggplot(heart_failure_df, aes(x='death_event', y='serum_sodium', color='diabetes')) +
geom_boxplot() +
labs(
title="Serum Sodium Levels by Diabetes Status and Survival Outcome",
x="Survival Outcome (0 = Survived, 1 = Died)",
y="Serum Sodium (mEq/L)",
color="Diabetes"
) +
theme(legend_position='bottom')
)
aside positive
We first run
.execute()to force the table expression to execute. This makes it easier to change the column types in preparation for plotting. We don’t need to worry about unecessarily executing the table expression because plotting requires evaluation anyways.
Duration: 5
Next, we’ll use Ibis to calculate the median values for various clinical metrics across different patient groups.
(
heart_failure_filtered
.group_by(["death_event", "diabetes"])
.aggregate(
median_age=heart_failure_filtered["age"].median(),
median_serum_creatinine=heart_failure_filtered["serum_creatinine"].median(),
median_serum_sodium=heart_failure_filtered["serum_sodium"].median()
)
)
This is a useful way to examine the information for ourselves. However, if we wish to share the information with others, we might prefer to present the table in a more polished format. We can do this with the Great Tables package.
The following code prepares a table named comparison
that we’ll display with Great Tables.
comparison = (
heart_failure_filtered
.group_by(["death_event", "diabetes"])
.aggregate(
median_age=heart_failure_filtered["age"].median(),
median_serum_creatinine=heart_failure_filtered["serum_creatinine"].median(),
median_serum_sodium=heart_failure_filtered["serum_sodium"].median()
)
.mutate(
death_event=ibis.ifelse(heart_failure_filtered["death_event"] == 1, "Died", "Survived"),
diabetes=ibis.ifelse(heart_failure_filtered["diabetes"] == 1, "Yes", "No"),
median_serum_creatinine=heart_failure_filtered["serum_creatinine"].median().cast("float64")
)
.rename(
{
'Survival Outcome': 'death_event',
'Diabetes Status': 'diabetes',
'Median Age': 'median_age',
'Median Serum Creatinine (mg/dL)': 'median_serum_creatinine',
'Median Serum Sodium (mEq/L)': 'median_serum_sodium'
}
)
)
Next, we use GT() and other Great Tables functions to
create and style a table that displays comparison. Note
that we need to evaluate comparison with
.execute() first because GT() only accepts
Pandas or Polars DataFrames.
(
GT(comparison.execute())
.tab_header(title="Clinical Metrics by Survival Outcome and Diabetes Status")
.fmt_number(
columns=["Median Age", "Median Serum Creatinine (mg/dL)", "Median Serum Sodium (mEq/L)"],
decimals=1
)
.data_color(
columns=["Median Serum Creatinine (mg/dL)", "Median Serum Sodium (mEq/L)"],
palette=["white", "blue"]
)
)
Now that we’ve accumulated some insights, let’s think about how we might present the results of our analysis to our colleagues.
Duration: 2
We’ve conveniently written our analysis in a Quarto
(.qmd) document, quarto.qmd.
Quarto is an open-source publishing
system that makes it easy to create data products such as documents,
presentations, dashboards, websites, and books.
By placing our work in a Quarto document, we have interwoven all of our code, results, output, and prose text into a single literate programming document. This way everything can travel together in a reproducible data product.
A Quarto document can be thought of as a regular markdown document, but with the ability to run code chunks.
You can run any of the code chunks by clicking the
Run Cell button above the chunk in VS Code.
When you run a cell, cell output is displayed in the Jupyter interactive console.
To render and preview the entire document, click the
Preview button or run
quarto preview quarto.qmd from the terminal.
This will run all the code in the document from top to bottom and and generate an HTML file, by default, for you to view and share.
You can learn more about Quarto here: https://quarto.org/, and the documentation for all the various Quarto outputs here: https://quarto.org/docs/guide/. Quarto works with Python, R, and Javascript Observable code out-of-the box, and is a great tool to communicate your data science analyses.
Duration: 2
One way to share our work and allow others to explore the heart failure dataset is to create an interactive Shiny app.
We’ve prepared an example Shiny app in the directory: https://github.com/posit-dev/snowflake-posit-quickstart-python. Our app allows the user to explore different clinical metrics in one place.
To run the app, open app/app.py and then click the Run
Shiny App button at the top of the script in VS Code.
Change the metric in the sidebar to control which metric is plotted.
You can learn more about Shiny at: https://shiny.posit.co/.
If you’re new to Shiny, you can try it online with shinylive. Shinylive is also available for R version.
Duration: 2
Python is a powerful, versatile tool for data science, and combined with Snowflake’s high-performance data capabilities, it enables robust, end-to-end data workflows. Using the Posit Workbench Native Application, you can securely work with Python within Snowflake while taking advantage of tools like Ibis, Quarto, and Shiny for Python to analyze, visualize, and share your results.